Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping identify autosomal monosomies | regexm

    Hello all:

    I am trying through loop through a string of karyotype to identify occurrence of "-1" through "-22" corresponding to autosomal monosomies using a loop. My loop code is picking -18 as a matches for -1 and -18. How do I restrict the loop appropriately? Sorry the data is not -dataex-able since many entries are really long strings.

    CLONE1:41,XY,del(5)(q33q35),-6,-7,dic(11;12)(p11.2;p13),dic(17;20)(p11.2;p11.2),-18,-22[15]CLONE2:42,idem,+8[4]CLONE3:82,idemx2[2]

    foreach num of numlist 1(1)22{
    gen loss_`num' = ustrregexm(kar, "-`num'") if !missing(kar)
    }

  • #2
    You might be able to do this with a negative lookahead assertion:

    Code:
    clear
    input strL kar
    "break-1break-2break-12break-18break-181break"
    "break1break2break12break18break181"
    "CLONE1:41,XY,del(5)(q33q35),-6,-7,dic(11;12)(p11.2;p13),dic(17;20)(p11.2;p11.2),-18,-22[15]CLONE2:42,idem,+8[4]CLONE3:82,idemx2[2]"
    end
    
    foreach num of numlist 1(1)22{
        gen loss_`num' = ustrregexm(kar, "-`num'(?![0-9]+)") if !missing(kar)
    }
    The new part of the pattern (?![0-9]+) is called a negative lookahead assertion. It just means that if one or more digits in the range 0-9 follow the given number, then the substring does not match the pattern. So if num equals `1` it doesn't match `18` because `1` is followed by another number `8`. Like any regular expression, whether or not this will work depends on the rules for the way the underlying string was constructed. I leave it to you to verify that this works correctly for your data.
    Last edited by Daniel Schaefer; 23 Dec 2022, 22:17.

    Comment


    • #3
      Originally posted by Daniel Schaefer View Post
      You might be able to do this with a negative lookahead assertion:

      Code:
      clear
      input strL kar
      "break-1break-2break-12break-18break-181break"
      "break1break2break12break18break181"
      "CLONE1:41,XY,del(5)(q33q35),-6,-7,dic(11;12)(p11.2;p13),dic(17;20)(p11.2;p11.2),-18,-22[15]CLONE2:42,idem,+8[4]CLONE3:82,idemx2[2]"
      end
      
      foreach num of numlist 1(1)22{
      gen loss_`num' = ustrregexm(kar, "-`num'(?![0-9]+)") if !missing(kar)
      }
      The new part of the pattern (?![0-9]+) is called a negative lookahead assertion. It just means that if one or more digits in the range 0-9 follow the given number, then the substring does not match the pattern. So if num equals `1` it doesn't match `18` because `1` is followed by another number `8`. Like any regular expression, whether or not this will work depends on the rules for the way the underlying string was constructed. I leave it to you to verify that this works correctly for your data.
      That code worked perfectly. Nice little thing I learnt.
      foreach num of numlist 1(1)22{
      gen loss_`num' = ustrregexm(kar, ",-`num'(?![0-9]+)") if !missing(kar)
      }
      egen loss_tot = rowtotal(loss_1-loss_22) if !missing(kar)
      label variable loss_tot "Total number of unique autosomal monosomies across all clones"


      Earlier, I had tried a lookup for all minuses using -egen- and -noccur- but the minus look up is dangerous. Just included the code here so somebody else might find it useful for specific situations:

      // Count of autosomal monosomies looking up all minuses
      egen nmono = noccur(kar), string(-)
      replace nmono = . if missing(kar)
      replace nmono = nmono-1 if ustrregexm(kar,"-Y")
      replace nmono = nmono-1 if ustrregexm(kar,"-X")
      label variable nmono "Total autosomal monosomies-imperfect"
      notes nmono: Picks ranges like 2-9, -mar2, 47-50, -X and -Y etc across multiple clones. Imperfect but could work

      Comment


      • #4
        Alternativ using the faster regexm(kar, "-`num'[^0-9]")

        Code:
        timer list
           1:    779.39 /      220 =       3.5427 ustrregexm(kar, "-`num'(?![0-9]+)")
          2:    275.55 /      220 =       1.2525 regexm(kar, "-`num'[^0-9]")
        Code:
        input strL kar
        
        "CLONE1:41,XY,del(5)(q33q35),-6,-7,dic(11;12)(p11.2;p13),dic(17;20)(p11.2;p11.2),-18,-22[15]CLONE2:42,idem,+8[4]CLONE3:82,idemx2[2]"
        end
        
        expand `=10^6'
        
        qui forvalues i=1/10 {
            
            foreach num of numlist 1(1)22 {
                
                noi di "." _cont
                        
                keep kar
                
                timer on 1
                gen byte loss_`num'_1 = ustrregexm(kar, "-`num'(?![0-9]+)") if !missing(kar)
                timer off 1
                
                timer on 2
                gen byte loss_`num'_2 = regexm(kar, "-`num'[^0-9]") if !missing(kar)
                timer off 2
                
                assert loss_`num'_1 == loss_`num'_2
            }
            
            noi di "`i' `c(current_time)'"  
        }
        
        timer list
        Last edited by Bjarte Aagnes; 25 Dec 2022, 09:37.

        Comment


        • #5
          Originally posted by Bjarte Aagnes View Post
          Alternativ using the faster regexm(kar, "-`num'[^0-9]")

          Code:
          timer list
          1: 779.39 / 220 = 3.5427 ustrregexm(kar, "-`num'(?![0-9]+)")
           2: 275.55 / 220 = 1.2525 regexm(kar, "-`num'[^0-9]")
          Code:
          input strL kar
          
          "CLONE1:41,XY,del(5)(q33q35),-6,-7,dic(11;12)(p11.2;p13),dic(17;20)(p11.2;p11.2),-18,-22[15]CLONE2:42,idem,+8[4]CLONE3:82,idemx2[2]"
          end
          
          expand `=10^6'
          
          qui forvalues i=1/10 {
          
          foreach num of numlist 1(1)22 {
          
          noi di "." _cont
          
          keep kar
          
          timer on 1
          gen byte loss_`num'_1 = ustrregexm(kar, "-`num'(?![0-9]+)") if !missing(kar)
          timer off 1
          
          timer on 2
          gen byte loss_`num'_2 = regexm(kar, "-`num'[^0-9]") if !missing(kar)
          timer off 2
          
          assert loss_`num'_1 == loss_`num'_2
          }
          
          noi di "`i' `c(current_time)'"
          }
          
          timer list
          Glad to see the difference in usage of regexm and ustrregexm. I had been pretty much using only ustrregexm in lieu for a while. I am just beginning to learn nesting loops.

          Comment

          Working...
          X